In this workshop we will automate the process of pulling data from a web page
Table of Contents
Take a look at this web page from Open Street Map:
UndergroundThere is a list of underground stations together with some useful information, such as the geo-coordinates of the stations. If we wanted to use this data in a program (e.g. to draw a map of the London Underground) we would need to get this data in to a more useful form, such as a list in Python or a JSON object in a Javascript program.
A really simple approach would be to select the table on the page using your mouse, copy it, and paste it into your favourite spreadsheet program. This will work, but has a few significant disadvantages:
Here is a more automated, code-free approach:
Here is an HTML to JSON tool which you can use for this process:
HTML to JSONThe above approach works to some degree, but it is not full automated. We still need to do some copying and pasting.
The rest of this workshop explores a coded approach, using node.js.
The source code for the examples in this worksheet can be found here:
Source CodeBefore we get started, download the code and open it up in Visual Studio Code.
Let's do a simple web-scraping exercise. We will scrape this test page:
Test PageThe test page looks like this:
We will be using a Node.js module called Cheerio to help us.
First, let's run this program. Open up a Terminal window in Visual Studio Code (menu Terminal > New Window
). Type in the following command:
node scrape1-simple.js
You should see the following message:
Found 8 rows
See data.csv for results
You should also see a file called data.csv
containing the following:
country
Albania
Armenia
Austria
Azerbaijan
Belarus
Belgium
Bosnia and Herzegovina
Bulgaria
Open the code file scrape1-simple.js
and take a look at it:
// Load the modules we need
const axios = require('axios'); // for sending web requests
const cheerio = require('cheerio'); // for web scraping
const scrape_helper = require("./scrape-helper") // for saving objects to csv
// Call the scrape function
scrape()
// Function to scrape a page
function scrape() {
// Specify the URL of page we want to scrape
let url = "https://www.thinkcreatelearn.co.uk/resources/node/web-scraping/sample1.html";
// Make the http request to the URL to get the data
axios.get(url).then(response => {
// Get the data from the response
data = response.data
// Load the HTML into the Cheerio web scraper
const $ = cheerio.load(data);
// Create a list to receive the data we will scrape
results = []
// Create a new csv file
scrape_helper.initialiseCsv('data.csv')
// Search for the elements we want
selection = $('h2')
// Add the elements to the list
selection.each((i,el) => {
text = $(el).text()
results.push({country:text})
})
// Save the data to the csv
scrape_helper.storeCsv('data.csv', results)
console.log("See data.csv for results")
}).catch((err) => {
// Show any error message
console.log("Error: " + err.message);
});
}
Essentially what the code is doing is searching for particular elements in the HTML and using them to build up a JSON object. Take a look at the HTML of the web page we are scraping (in Chrome, visit the Test Page and right-click on the page, then select View Page Source).
Here's the HTML code for this page. The H2 tags are highlighted in yellow:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Web scraping exercise</title>
<style>
.fpar {
font-family: Georgia;
}
</style>
</head>
<body>
<p>Sample for web scraping exercises</p>
<h1>Countries beggining with A</h1>
<h2>Albania</h2>
<h3>Tirana</h3>
<h2>Armenia</h2>
<h3>Yerevan</h3>
<h2>Austria</h2>
<h3>Vienna</h3>
<h2>Azerbaijan</h2>
<h3>Baku</h3>
<h1 id="b-countries">Countries beggining with B</h1>
<h2>Belarus</h2>
<h3>Minsk</h3>
<h2>Belgium</h2>
<h3>Brussels</h3>
<h2>Bosnia and Herzegovina</h2>
<h3>Sarajevo</h3>
<h2>Bulgaria</h2>
<h3>Sofia</h3>
</body>
</html>
The web scraping code is looking for all the <h2>
tags:
// Search for the elements we want
const selection = $('h2')
Then building up the list of all the h2 tags:
// Add the elements to the list
selection.each((i,el) => {
text = $(el).text()
results.push({country:text})
})
Then saving the results as a csv file:
// Save the data to the csv
scrape_helper.storeCsv('data.csv', results)
console.log("See data.csv for results")
Table of Contents